Text-informed speech inpainting
Let’s say that you want to insert new words in the middle of a sentence, potentially removing words at the same time — the goal of inpainting is to generate the missing audio of the edited sentence, so that the result sounds as natural as possible. Perfect inpainting would mean that anyone listening to the audio would not be capable of detecting which words were generated and which were not.
Surprisingly, there has been little research on the topic of text-informed speech inpainting to date, while similar problems in vision or natural language processing have already been explored, including:
A key difference is that text-informed speech inpainting, where the user edits a sentence to generate audio, combines two data types at the same time: text and audio. In other words, it is audio inpainting conditioned on text.